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Abstract 

Background: Discerning the traits evolving under neutral conditions from those traits evolving rapidly because of 
various selection pressures is a great challenge. We propose a new method, composite selection signals (CSS), 
which unifies the multiple pieces of selection evidence from the rank distribution of its diverse constituent tests. 
The extreme CSS scores capture highly differentiated loci and underlying common variants hauling excess 
haplotype homozygosity in the samples of a target population. 

Results: The data on high-density genotypes were analyzed for evidence of an association with either polledness or 
double muscling in various cohorts of cattle and sheep. In cattle, extreme CSS scores were found in the candidate 
regions on autosome BTA-1 and BTA-2, flanking the POLL locus and MSTN gene, for polledness and double muscling, 
respectively. In sheep, the regions with extreme scores were localized on autosome OAR-2 harbouring the MSTN gene 
for double muscling and on OAR-10 harbouring the RXFP2 gene for polledness. In comparison to the constituent tests, 
there was a partial agreement between the signals at the four candidate loci; however, they consistently identified 
additional genomic regions harbouring no known genes. Persuasively, our list of all the additional significant CSS 
regions contains genes that have been successfully implicated to secondary phenotypic diversity among several 
subpopulations in our data. For example, the method identified a strong selection signature for stature in cattle 
capturing selective sweeps harbouring UQCC-GDF5 and PLAGl-CtiCtiD/ gene regions on BTA-13 and BTA-14, 
respectively. Both gene pairs have been previously associated with height in humans, while PLAGl-ChlChlD7 has 
also been reported for stature in cattle. In the additional analysis, CSS identified significant regions harbouring 
multiple genes for various traits under selection in European cattle including polledness, adaptation, metabolism, 
growth rate, stature, immunity, reproduction traits and some other candidate genes for dairy and beef production. 

Conclusions: CSS successfully localized the candidate regions in validation datasets as well as identified previously 
known and novel regions for various traits experiencing selection pressure. Together, the results demonstrate the utility 
of CSS by its improved power, reduced false positives and high-resolution of selection signals as compared to individual 
constituent tests. 
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Background 

Genetics research has increased rapidly with availability 
of high throughput molecular biology tools and analytical 
approaches [1]. Recent molecular genetics techniques 
combined with large scale in silico analysis of genetic poly- 
morphism data have provided insights to many questions 
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about the origin of species [2], evolution [3], co-evolution 
and selection [4], domestication [5], genetic control of 
adaptation and diseases [6-8], and genetic diversity [9,10] 
for a wide range of species. More recently, identification 
of chromosomal regions that contain signatures of selec- 
tion has been helpful to understand various mechanisms 
of adaptation, domestication and selection for important 
traits of various domestic species [11-21]. 

Evidence of selection can be gained from the measures 
of population differentiation, the allele frequency spectrum. 
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linkage disequilibrium (LD) and haplotype structures 
[22,23]. Multiple methods have been developed for 
detecting selection signatures from genomic sequences 
and single nucleotide polymorphism (SNP) data [24,25]. 
Popular methods to capture selection evidence among 
populations from genetic polymorphism data include fix- 
ation index (Fst) [26,27], change in derived allele frequen- 
cies (ADAF) [23], allele frequency differences [28], long 
range haplotype (LRH) tests based on the extended 
haplotype homozygosity (EHH) statistic [29] including 
the across population extended haplotype homozygosity 
(XP-EHH) [22] and Rsb [30]. The specificity of each 
selection test statistic is limited to test certain aspects 
of selective forces operating under various models of 
natural and artificial selection. Hence, various selection 
tests being used often provide differing results for the 
same genomic dataset and likely none of these can 
exclusively provide a definite conclusion about the 
selective hypotheses [31]. 

Populations undergoing directional or divergent selec- 
tion for specific traits are expected to exhibit signals of 
selection at the underlying genomic regions when mea- 
sured by several selection tests [32]. Therefore, a com- 
bination of multiple strategies can be a robust approach 
in localizing such selected regions and correlating them 
with phenotypic variation. Several approaches to com- 
bine multiple summary statistics have been implemented 
that improve the power of detecting selection signatures 
[16,23,31,33,34]. Grossman et al. [23] developed a 
Bayesian estimator, composite of multiple signals (CMS), 
that combines several statistics to localize causal variants 
of positive selection. CMS requires extensive simulations 
and knowledge of the population genetic history to 
explore selection events under robust models with their 
underlying assumptions [15]. Success of CMS depends on 
the availability of very dense SNP data (for example, > 3 
million SNPs in the human 1000 Genomes Project) re- 
quired to approximate all the genome-wide functional 
variants. Lin et al. [31] and Pavlidis et al. [33] used ma- 
chine learning methods implementing boosting and 
support vectors, respectively, which combines multiple 
statistics to maximize their joint predictive performance. 
They too require prior information from the estimates of 
population genetic diversity along with powerful computa- 
tion platforms. Other efforts have also been made by com- 
bining selection signatures with association analysis in 
multiple species, however, these require information of 
phenotypes on individuals and in some cases also about 
their progeny [12,34,35]. Recently, Utsunomiya et al. [16] 
employed the Stouffer weighted Z-method [36] for 
combining ^-values of several selection tests in their so 
called Meta-SS (meta-analysis of selection signals). Their 
assumptions to retrieve j^-values directly from the test sta- 



tistics require that each constituent test follow (approxi- 
mately) a normal distribution, centred on zero under the 
null hypothesis if no selection. The implementation of 
Meta-SS is, therefore, limited to selected tests and incom- 
patible on some popular selection tests such as Fgr where 
the distribution (under the null hypothesis) is not known. 
The limitations and complexity of methods, prior in- 
formation, high-density genotypes and powerful com- 
putational resources required to implement available 
combining approaches leaves researchers with limited 
resources at a disadvantage. 

Understanding the genetic control of heritable pheno- 
types is decisive to implement strategies for the rapid 
improvement in the qualitative and quantitative features 
of any domesticated species. Owing to the high genetic 
diversity in cattle and sheep, with over 800 and 1400 
breeds, respectively, and substantive known factors for 
shaping their genetic diversity, they have been exten- 
sively used as model species for exploring selection sig- 
natures [11-21,32,37-42]. 

In general, genetically alike populations are expected 
to share genetic polymorphism at the genomic regions 
carrying genes for common phenotypes, whereas, genet- 
ically isolated populations may have uniquely positioned 
or divergent patterns of polymorphism on the genome 
[11,15,43]. Combining genotypic data on multi-population 
panels for identical traits has been used successfully to es- 
timate the genomic breeding values and genomic selection 
[44,45], local adaptation [43], phylogeography and breed- 
ing history [11,46], and association mapping [47]. There- 
fore, detection of signatures of strong selection can be 
boosted by combining samples from multiple breeds 
based on known traits and compare such multi-breed 
populations for the contrasting phenotypes [12,15,48]. 
Across phenotypic groups, the contrast in genetic vari- 
ation at the putative genomic regions increases the likeli- 
hood of capturing the selection signatures linked to the 
traits of interest. Within groups, the genome-wide genetic 
diversity between multiple breeds will lower background 
noise (false positive signals) which have accumulated con- 
founding genetic patterns due to the demographic history 
of breeds or by the random genetic drift [47]. 

In principle, a simple method to combine outputs 
from separate tests based on their statistical distributions 
can be used to increase the accuracy of linking geno- 
types (genomic regions) with phenotypes without prior 
information on population history, individual pheno- 
types or genetic relationships. Here we present an im- 
provement in the trait-specific genome-wide scans based 
on SNP data to map selection signatures by unifying 
multiple information from: i) evidence of selection, and 
ii) phenotypically alike populations. We developed a 
composite index of selection signatures: composite 
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selection signals (CSS), and tested this against pheno- 
types controlled by known major genes in cattle and 
sheep. In addition, we investigated European and African 
Bos taurus cattle to identify the signatures of selection in 
geographically isolated populations. 

Methods 

DNA samples and genetic polymorphism data 

Utility of the composite selection signal was tested in 
cattle and sheep by analyzing data available from various 
published studies on both species. To add power by in- 
creasing the sample size and to maximize the range of 
breeds and animals within breeds, samples collected by 
independent research groups were merged. Cattle data 
consisted of 1,096 animals representing 56 cattle breeds 
as described in previous studies [3,10,39,49]. Genetic re- 
lationships from the genome-wide SNPs were estimated 
by computing a genome-wide IBS matrix using PLINK 
[50] to identify and remove duplicate samples across 
multiple datasets of cattle. The sheep dataset consisted 
of 2,803 animals from 74 breeds [11]. The samples and 
breeds of cattle and sheep included in this study are 
listed in (Additional file 1: Table SI) and (Additional file 2: 
Table S2), respectively. 

SNP genotypes generated in previous studies on cattle 
[3,10,39,49] and sheep [11] genotyped with the Illumina 
BovineSNP50 chip and Illumina OvineSNP50 chip assays, 
respectively, were used in the present analysis. After 
quality control, 38,610 and 47,502 autosomal SNPs 
were retained for cattle and sheep, respectively (Additional 
file 3: Table S3), and the final number of heterozygous 
SNPs (minor allele frequency (MAF) > 0.01) in each data- 
set is given in Table 1. Imputation of sporadic missing 
genotypes and haplotype phasing was performed with 
BEAGLE 3.3 [51]. Ancestral alleles were inferred for cattle 
genotype data using information from Matukumalli et al. 
[52] and, when possible, using information from the geno- 
types of three out-group species (bison, buffalo and yak) 
from Decker et al. [3]. All SNPs were mapped on the 
UMD3.1 bovine genome assembly (http://www.cbcb.umd. 
edu/research/bos_taurus_assembly) and OARvl.O ovine 
genome assembly (http://www.livestockgenomics.csiro.au/ 
sheep/oar l.O.php) for the corresponding species. 

Phenotype data 

Two subsets from both the cattle and the sheep data, 
collectively called as validation datasets (A-D), were ex- 
tracted based on traits known to be under control of a 
major autosomal gene, namely double muscling (increased 
skeletal muscle mass) and polled (absence of horns) phe- 
notypes (Table 1). In cattle, the dataset A consisted of ani- 
mals of seven polled breeds and seven horned breeds. The 



dataset B of cattle consisted of animals from three double 
muscle breeds and 14 normal muscle beef breeds. In 
sheep, the dataset C contained animals from 37 naturally 
polled sheep breeds and 36 horned sheep breeds and the 
dataset D had data on animals from three breeds known 
to be double muscled and 71 breeds without the double 
muscle phenotype. 

Candidate genes for the two traits in validation datasets 
(A-D) of both species are described as follows: 

Polledness in cattle 

POLL locus is located at the proximal end of bovine 
autosome 1 (BTA-1) at 1.65-2.05 Mb position. The 
dominant alleles of causal mutations in the genes 
harbouring the POLL locus cause the polledness in 
cattle [15,17,20,47,52,53]. 

Double muscle in cattle 

Bovine Myostatin (MSTN) i.e. growth and differentiation 
factor 8 {GDF8) gene (BTA-2: 6213566 - 6220196 bp) 
harbours various alike-in-state mutations in its third exon 
that underlie the muscular hypertrophy (a partially reces- 
sive trait) in some beef cattle breeds. For example, the 
double muscles are linked to the loss-of-function substitu- 
tion in Piedmontese (and rarely in other beef breeds) and 
a frame-shifting 11 nucleotide deletion in Belgian Blue, 
South Devon and Asturiana de los Valles [20,39,54,55]. 

Polledness in sheep 

Relaxin/insulin-like family peptide receptor 2 {RXFP2) gene 
on ovine autosome 10 (OAR-10: 29491481 - 29538132 bp) 
is located in a known selected genomic region linked to 
the horn morphology in sheep [11,56,57]. 

Double muscle in sheep 

Ovine M^JTVgene on OAR-2 (126318371 - 126323354 bp) 
harbours a single loss-of-function mutation in its 
3 '-untranslated region (strongly selected in Texel) 
that inhibits its translation resulting the double muscle 
in sheep [11,55,58]. 

In addition, for dataset E, cattle breeds of European 
(46 breeds, 847 animals) and African (7 breeds, 226 ani- 
mals) origin were compared (Table 1). There were sev- 
eral cattle breeds of small sample size {n < 20) in the 
European group. Therefore, the effect of sample size on 
the computation of our composite and constituent selec- 
tion tests was also assessed by comparing results from 
analyses by excluding and including the breeds with 
small sample size {n < 10 and n < 20). 
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Table 1 Breeds, samples, genotypes (SNPs) and known genes In each group of cattle and sheep 


Species 


Trait 


Groups 


Breeds 


Animals 


Genome 


SNPs 


SNP density 


Derived 


Known 


Dataset 








{nf 


in) 


assembly 




(kb) 


SNPs (n) 


genes 


code 




Polledness 


Poll head 
Horn head 


7 
7 


85 
127 


UMD3.1 


38,290 


65.50 


38,177 


POLL locus 


A 


Cattle 




Double 


3 


49 
















DoubiB 
muscling 


muscling 

Normal 
muscling 


14 


308 


UMD3.1 


38,520 


65.15 


38,407 


MSTN 


B 




Polledness 


Poll head 
Horn head 


37 
36 


1489 
1290 


OARvl.O 


47,498 


51.26 




RXFP2 


C 


Sheep 


Double 
muscling 


Double 
muscling 

Normal 
muscling 


3 
71 


149 
2654 


OARvl.O 


47,502 


51.26 




MSTN 


D 


Cattle 


Geographic 
location 


African 
European 


7 

46 


226 
847 


UMD3.1 


37,905 


65.67 


37,795 




E 



'Details of breeds and genotyping information about cattle and sheep is available in the (Additional files 1, 2 and 3: Table SI, S2 and S3, respectively). 



Test statistics for selection signatures 

The signatures of recent positive selection are expressed 
as a localized increase in allelic frequency of the benefi- 
cial mutations towards fixation in the population. Non- 
ancestral alleles at mutated loci are called "derived" 
alleles and usually, the function-altering derived alleles 
create the phenotypic diversity. The excess of recently 
selected beneficial (ancestral or derived) alleles results in 
a 'hitchhiking' of neighbouring polymorphisms which re- 
sults in extended haplotype homozygosity in the region 
of selection [22]. We selected three single test statistics 
which capture the increase in highly differentiated loci 
(Fst), or increase in derived allele frequency (ADAF and 
ASAF), or the increase in haplotype homozygosity (XP- 
EHH) along the genome in each of the five datasets. A 
brief implementation of each test statistic is described 
below. The new method, which we term as composite 
selection signal (CSS), combines the three estimates of 
the single selection tests in a single index. 

FsT 

The fixation index (Fst) of population differentiation is 
estimated from the deviation in allele frequency between 
populations compared against the within population 
polymorphic frequency [26]. It can detect selection sig- 
natures using genetic polymorphism data by a pairwise 
comparison between two contemporary populations. 
SNP-specific Fst values were computed for each pair of 
phenotypically contrasting groups within all the sets of 
cattle and sheep data using a custom R script available 
upon request. Extreme positive values of Fst for the par- 
ticular locus are indicative of high levels of reproductive 



isolation of the two populations and divergent selection 
in both or strong positive selection in one of the popula- 
tions and/or random drift. 



Highly differentiated SNPs with an excess of new muta- 
tions (derived alleles) can be identified by the distribu- 
tion of derived allele frequency (DAF). Change in the 
DAF (ADAF) was calculated as the difference of DAF in 
the putative selected population or group (Ds) and the 
DAF in the alternative non-selected populations or 
groups (Dns)) where ADAF = Ds - D^s as given in 
Grossman et al. [23]. ADAF scores have an approximate 
normal distribution. We standardized ADAF to have a 
zero mean and unit variance to identify the outlier SNPs. 
The use of the ADAF statistic was restricted to cattle data 
where the derived and ancestral allele could be inferred 
unambiguously. In sheep, no such out-group was avail- 
able; hence, the ancestral allele could not be inferred. 



To accommodate the lack of information on ancestral 
allele in sheep, we developed a simple statistic based on 
the allele frequency differences between the populations. 
Based on the observed allele frequency distributions, we 
calculated the directional change in the selected allele 
frequency (ASAF) across two populations / and so that 
ASAF =j^,-j^., where, f^, is the frequency of allele A, 
the major allele in the putatively selected population /; 
similarly, is the frequency of allele A in non-selected 
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population ASAF scores were also standardized to Z ~ 
A^(0,1). Since the estimates of ADAF and ASAF are a 
function of the allele frequency distributions, a signifi- 
cant association is expected for loci under strong selec- 
tion and can be used alternatively depending on the 
availability of required information about derived and 
ancestral alleles. Comparison between ADAF and ASAF 
to validate the latter using the cattle data has shown a 
very strong correlation (r > 0.8) for the SNP scores at 
candidate gene regions and genome-wide. Replacement 
of ADAF by ASAF as input in CSS has shown no appre- 
ciable difference in the results for the control regions of 
cattle (data not shown). 



XP-EHH 

A multi-allelic (haplotype based) test has many advan- 
tages in studying genome-wide patterns of divergence 
over single locus (SNP) analyses, since the latter may be 
less informative due to ascertainment bias in the SNP 
discovery process [59]. Long-range haplotype (LRH) 
tests can detect the signals of positive selection by find- 
ing common alleles carried on unusually long haplo- 
types. Due to LD, selection pressure on a beneficial 
allele at a polymorphic locus can also affect the neigh- 
bouring neutral loci, resulting in long haplotypes of low 
diversity across extended regions [60]. Extended haplo- 
type homozygosity (EHH) detects selection signatures by 
comparing a base (core) haplotype, characterized by high 
frequency and extended homozygosity, with other haplo- 
types at the selected locus. EHH is the probability that 
two randomly selected chromosomes carrying the candi- 
date core haplotype are homozygous for the entire inter- 
val spanning the target region for a given locus. The 
EHH statistic depends on the allele frequency and the 
strength of LD with neighbouring loci; hence, it is 
applicable to an incomplete selective sweep when the 
selected allele becomes very frequent but is not yet fixed 
within a given population. EHH is less robust in a 
situation where the selected alleles may have reached 
fixation and their alternative alleles have disappeared 
in a population i.e., a complete selective sweep [43]. 
Complete selective sweeps can be dealt with using the 
across population EHH (XP-EHH) test, which compares 
each population (breed) with the other population(s) on 
corresponding haplotypes. XP-EHH has high power to 
detect selection signatures in small sample sizes and 
power may be gained by the grouping of genetically 
similar breeds [22,23,29,43]. We calculated the XP-EHH 
for each of the five datasets using the procedure de- 
scribed by Sabeti et al. [22]. Further, XP-EHH scores 
were standardized in each analysis so that a genome- 
wide distribution of all scores has zero mean and unit 
variance. 



Composite Selection Signals (CSS) 

Three selection tests (Fst, XP-EHH, ADAF or ASAF) 
were combined with the hypothesis that a common sig- 
nal across the multiple test statistics would be detected 
as an extreme CSS score at the trait specific genomic 
positions. The following outlines the method used to 
compute CSS scores from combining the three compo- 
nent test statistics for the same SNP, as well as deter- 
mining /^-values for these composite tests, to test for the 
existence of a common signal. 

Let Tij be the test statistic using method /, (/ = 1, m) 
calculated at SNP (/= 1, n). Then for each test stat- 
istic type /, obtain the rank of each observed test statistic 
across all n SNPs, say Rij = mnk{Tij), which takes values 
1, Next, these ranks are converted to fractional ranks 
by re-scaling them to lie between 0 and 1, i.e. R'ij = Rij/ 
{n + 1), giving values from l/{n + 1) through n/{n + 1). 

Note that the fractional rank does not use the magni- 
tudes of the actual test statistics: this makes it inherently 
robust, as in any other nonparametric procedures that 
are based on ranks. However, there is therefore some 
loss of information. Some of this information may be re- 
covered by converting the fractional ranks to :2-statistics, 
Zij = <^'^{R\j), where 0"^() is the inverse cumulative 
distribution function (CDF) for a standard normal, i.e. 
maps values 0 through 1 to an underlying standard nor- 
mal distribution, Z~A^(0,1). Once converted to normal 
scores, the average values were calculated at each SNP 
position, Zy, 7 = 1, n, and j?- values were directly ob- 
tained from the distribution of means from a nor- 
mal distribution, Z - A/'(0, m"^) , i.e. p = l-o(^ni^^Zj^ 

where 0( ) is the CDF for a standard normal distribution. 

The log-transformed j?-values {-logiop) corresponding 
to the set of mean Z values (Zy) were declared as the 
composite selection signals (CSS) and these were plotted 
against the genomic positions to identify the significant 
selection signals. If there is a common signal across the 
multiple test statistics, this will show up as an excess in 
CSS at that point, otherwise, CSS may be dampened 
down, i.e., regressed to the genome-wide average. 



Significant SNPs under selection 

The results from five datasets (Table 1) were compared 
across three constituent tests and CSS. In the absence of 
a known probability distribution for most cases of the 
test statistics used in this study, SNPs with extreme test 
scores (top 0.1%) in the genome- wide distribution were 
considered significant [11]. Selected variants tend to 
impose the selection pressure on neighbouring alleles 
because of hitchhiking; therefore, significant signals are 
expected to cluster together. Hence, in order to 
minimize the spurious noise from single SNP tests with 
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resultant false positives, the test statistics were averaged 
(smoothed) over SNPs within 1 Mb sliding windows 
centred at each SNP along the chromosomes. 

Genomic regions and genes under selection 

Clusters consisting of a multiple SNPs with the extreme 
CSS test statistics (top 0.1%) spanning 1 Mb windows 
around the SNP with most extreme value were selected. 
This was termed as a significant cluster by each test and 
its boundaries were defined by the first and last SNP. 
Consecutive clusters spaced less than 1 Mb apart were 
merged into a single cluster. Further, for mining candidate 
genes, we define the genomic regions underlying the sig- 
nificant clusters by including an additional 0.5 Mb on each 
side, considering genome-wide uniform LD patterns. 

For comparison across multiple tests, we identify the 
genomic region by each test and count the numbers of 
significant SNP scores in other selection tests within 
each region. For example, at the first step, regions were 
defined by CSS and significant SNPs were counted in 
XP-EHH, FsT and ADAF (or ASAF). 

The significant genomic regions were investigated for 
genes that mapped on the respective genome assembly 
of both species for the candidate traits. For the genes in 
non-candidate regions identified by CSS, we further in- 
vestigated the respective subpopulation for any add- 
itional phenotypes that might have been under positive 
selection. Similarly, genes underlying the significant gen- 
omic regions in geographic population groups of cattle 
were also investigated to understand the historic and 
commercial imprints of selection. 

False discovery rate 

The control of false positive signals in multiple hypoth- 
eses testing is essential in genomic studies. The false dis- 
covery rate (FDR) is considered a reliable statistical 
method for correction in case of multiple comparisons. 
The estimation of FDR is influenced by the accuracy 
of the P'Vdlue estimations and the validity of their 
underlying distributional assumptions. Correctly estimated 
/^-values from the null hypothesis are assumed to exhibit a 
uniform distribution. Usually, on the other hand, observed 
distribution of ^-values from multiple tests consists of a 
mixture of distributions of /^-values from true null hypoth- 
eses along with true alternative hypotheses. To improve 
the accuracy of FDR estimation, empirical j^-values from 
non-smooth CSS were calibrated using the constrained 
regression recalibration (ConReg-R) method so that the 
observed j?-values have the properties of an ideal empirical 
P'VBlue distribution [61]. The taU area based FDR 
(<5^-values) were estimated from the calibrated /^-values 
using the R package "fdrtool" [62] with its default options 
for "st2itistic =P'Value\ when it uses the empirical data 



below the 75th percentile to determine the null distribu- 
tion of the test statistics. 

FDR were computed against the calibrated j^-values 
for the raw CSS scores of each validation dataset ana- 
lysis. Within the significant region boundaries, the per- 
centages of SNPs having FDR < 5% were calculated. To 
differentiate the distribution of true null and true alter- 
nate hypotheses, we compared the density distribution 
of FDR (<5^-values) of SNPs within significant regions 
against the rest of genome-wide SNPs. 

Results 

Identification of significant loci 

The map of chromosomes containing highest empirical 
CSS scores within each trait-wise dataset (A to D) is pre- 
sented in Figure 1. Genome-wide comparisons of empirical 
distributions of all the selection tests across the four valid- 
ation datasets are shown in (Additional file 4: Figure SI), 
(Additional file 5: Figure S2), (Additional file 6: Figures S3) 
and (Additional file 7: Figure S4). A strategy of smoothing 
SNP-wise empirical statistics was applied to three compo- 
nent selection tests and composite selection signals: for 
each case, the mean number of SNPs in genome-wide 
1 Mb windows was 17 and 19 SNPs in cattle and sheep 
data, respectively (Additional file 8: Figure S5). The win- 
dows containing fewer than 5 SNPs were discarded from 
further analysis. After pruning such low SNP density win- 
dows, 38,211 (dataset A) and 38,441 (dataset B) sliding 
windows were retained for polled and double muscle cat- 
tle, respectively. Similarly, 47,438 (dataset C) and 47,442 
(dataset D) sliding windows of averaged (smooth) test 
statistics were used from the polled and double muscle 
sheep analyses, respectively. 

Genome-wide low to moderate correlations among the 
pairs of three single tests suggest a partial concordance 
among these tests; whereas, CSS has a high correlation 
with its all component tests, which suggests capture of 
information across multiple tests (Additional file 9: 
Figure S6). The genome-wide map of empirical scores 
(non-smoothed) and smoothed scores indicates a number 
of genomic regions with clusters of SNPs with high scores 
in each of the four analyses. 

The magnitude of smoothed CSS in the significant 
clusters was affected by the SNP density and extent of 
LD between the SNPs within the sliding window. For 
example, the POLL locus is located on the proximal end 
(rich crossing over region) of BTA-1 where the high 
recombination rate reduces the LD among neighbouring 
SNPs (Table 1). In dataset A, highly significant raw CSS 
scores were located in the candidate gene region on 
BTA-1 (Figure 1-A), whereas existence of strong LD 
(see Discussion) on BTA-14 has lifted this region to the 
top of the smoothed distribution as shown in the 
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Figure 1 Composite selection signals (CSS) for validation datasets. Chromosome-wise plots of highest CSS scores are shown for trait-wise 
datasets of cattle (A and B) and sheep (C and D). The dotted red horizontal lines in the CSS plots indicate the genome-wide 0.1% thresholds of 
the empirical scores. Smooth lines are the smoothed CSS scores by averaging SNPs within each 1 Mb window. Vertical green lines indicate the 
location of candidate genes at each chromosome as follows: A = POLL locus for polledness in cattle (dataset A), B = MSTN for double muscle in 
cattle (dataset B), C = RXFP2 for polledness in sheep (dataset C), and D = MSTN for double muscle in sheep (dataset D). 



genome-wide distribution in (Additional file 4: Figure 
Sl-A). In datasets B, C and D, in contrast to dataset A, 
the magnitude of raw as well as smoothed CSS scores 
remained on top in the genome-wide distribution be- 
cause their candidate regions were localized in cold- 
spots of less frequent recombination (Additional files 5, 
6, 7: Figure S2 to S4). 

Significant genomic regions under selection in validation 
datasets 

Of the genome-wide smoothed test statistics, the top 39 
and 48 SNPs (i.e. top 0.1%) in the cattle and sheep data- 
sets, respectively were used to find significant regions 
under selection. A number of selection signals were 
found in each dataset by all the test statistics. Overall, 9, 
12, 10 and 5 genomic regions were detected in datasets 
A, B, C and D, respectively (Additional file 10: Table S4). 
These multiple significant regions were the result of low 
concordance between the component tests and their 
power to capture slightly different characteristics of the 
selective sweep. Note that across the four datasets, 15, 



15 and 21 genomic clusters were captured by XP-EHH, 
FsT and ADAF/ASAF, out of which 4, 5 and 13 regions 
were specific to individual tests. These 36 regions were 
narrowed down to 12 significant regions with the CSS 
approach (Table 2). 

Regions identified through CSS were further investi- 
gated to find specific genes associated with positive se- 
lection. A number of genes were found in each region; 
therefore, precise inferences about the specific target of 
selection may be difficult. The results from the compo- 
nent tests suggest a high concordance for significant 
clusters in the candidate regions but also a number of 
additional significant signatures located in genomic re- 
gions of unrelated or unknown genes (Additional file 10: 
Table S4). The concordance between the three distinct 
tests statistics at the four control regions establishes the 
support of CSS for detecting true selection signatures. 
The CSS test has fewer significant clusters and most of 
these are close (where SNPs are missing within genes) or 
harbouring the genes associated with the traits of inter- 
est in all datasets. We briefly describe the genomic 
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Table 2 Genomic regions under selection in cattle and sheep identified using composite selection signals (CSS) 



Region^ 


Chr 


Position'^ 


Number of significant SNPs 


Total 


Known genes'^ 


Gene function 


(Mb) 


CSS 


XPEHH 


FST 


ADAF 


genes'^ 


Al 


1 


1.01-2.63 


10^ 


9 


1 


- 


15 


POLL locus 


Polledness 


A5 


13 


63.90-65.97 


18 


23* 


1 


5 


26 


UOCQ GDF5 


Stature 


A7 


14 


23.78-25.61 


11 


7 


5* 


10* 


12 


PLAGh CHCHD7 


Stature 


B1 


2 


6.15-7.82 


10* 


11* 


3 


- 


9 


MSTN 


Double muscle 


B2 


6 


66.55-68.11 


11 


8 






6 


COX7B2, FRYL 


Reproduction 


B6 


16 


44.49-46.05 


11 


11 


1 




12 


NMNATh PIK3CD, SPSBl, SLC 


Embryonic growtli, immunity 


B8 


18 


13.34-15.03 


5 


3 


1 




33 


MCIR 


Coat colour 


C5 


10 


28.54-30.05 


26* 


17* 


34* 


5* 


9 


RXFP2 


Polledness 


C8 


13 


66.97-68.50 


7 




7 


3 


17 


ASIP 


Coat colour 


CIO 


25 


6.67-8.29 


14 


10 




2 


16 


LRP4 


Bone growth 


D2 


2 


119.62-122.30 


20 


11* 


10 


16 


26 






D4 


2 


124.25-128.05 


28* 


22 


27 


27 


47 


MSTN 


Double muscle 



Cluster of a minimum of three significant SNPs within a window spanning 1 Mb genomic locations centred on a core SNP above the threshold (top 0.1%) in CSS 
(smoothed statistics) are reported and are compared with the constituent tests. 

^Prefix (A, B, C and D) with each region number represents the dataset as defined in Table 1 and rows in bold indicate the genomic regions containing candidate 
genes. A complete list of 36 genomic regions, their positions, range of all significant clusters (for each test) and genes under clusters of significant SNPs is shown 
in [Additional file 10: Table S4]. 

"^Position of genomic regions includes a 0.5 Mb extension on both sides of boundaries of the main cluster identified by CSS to compare constituent tests and 

count of genes (see Methods). Large sized (> 1 Mb) regions are formed by joining successive (<1 Mb apart) clusters. 

'^Genes mapped on bovine (UMD3.1) and ovine (OARvl.O) assemblies within the boundaries of genomic regions. 

'^Candidate genes with known functional/structural effects for a particular trait present in the contrasting panels of multiple breeds. 

*lndicates the cluster of highest ranked SNPs (raw scores) for a particular selection test. 



regions under selection identified from each dataset by 
CSS as follows: 



Signatures of selection in validation datasets 

The genome-wide map of empirical scores (non-smoothed) 
indicates that the highest CSS above the 0.1% threshold 
were in the candidate regions in all of the four analyses 
(Figure 1). At least five significant SNPs for CSS were found 
for each trait within the respective genie regions. The three 
component tests (FsT) ADAF or ASAP, and XP-EHH) were 
found coinciding in the candidate gene regions but with 
fewer and lower ranked SNPs as compared to the CSS test. 

In dataset A, significant CSS scores were found in the 
candidate region (BTA-1) harbouring the POLL locus for 
polledness in cattle (Figure 1-A, and region Al in 
Table 2). Two additional significant clusters were found 
on BTA-13 (region A5: UQCC-GDFS genes) and BTA-14 
(region A7: PLAG1-CHCHD7 genes) (Table 2, Additional 
file 4: Figure SI- A). 

In dataset B, the highest CSS scores were localized at 
BTA-2 flanking MSTN, the gene responsible for double 
muscling in cattle (Figure 1-B, and region Bl in Table 2). 
Additional peaks of significant CSS were located on 
BTA-6 (region B2: COX7B2 gene and near FRYL, PDGFRA 
genes), BTA-16 (region B6: SLC25A33 and SLCC4SA1 
genes) and BTA-18 (region B8: MCIR gene) (Table 2, 
Additional file 5: Figure S2-A). 



In dataset C, the candidate region on OAR- 10 har- 
bouring the RXFP2 gene for polledness in sheep con- 
tained the extreme CSS scores (Figure 1-C, and region 
C5 in Table 2). In addition, OAR-13 (region C8: ASIP 
gene) and OAR-25 (region CIO: LRP4 gene) exhibit the 
one significant peak each (Table 2, Additional file 6: 
Figure S3- A). 

In dataset D, extreme CSS scores were found flanking 
the MSTN gene for double muscling in sheep on OAR-2 
(Figure 1-D, and region D4 in Table 2). Notably, both 
significant peaks are on OAR-2 and are in the candidate 
region or in LD with the candidate gene region spanning 
an 18 Mb region (Table 2, Additional file 7: Figure S4-A). 

The non-candidate regions in datasets A, B and C, 
contain genes that have been previously linked to vari- 
ous phenotypes in several species. Some of these genes 
were associated with phenotypes within our subpopula- 
tions (see Discussion). Overall, presence of the signifi- 
cant clusters of extreme CSS scores in the candidate 
regions of the cattle and sheep cohorts indicates im- 
proved power of CSS as compared to the constituent 
individual tests. 

False discovery rate (FDR) 

While the distribution of j^-values for regions without 
evidence of selection is not uniform (Additional file 11: 
Figure S7), there is nonetheless a clear spike' in the 
frequency of very small j?- values, lending support for 



Randhawa et al. BMC Genetics 2014, 15:34 
http://www.bionnedcentral.conn/1471-21 56/1 5/34 



Page 9 of 1 9 



evidence of selection signatures. Nonetheless, genome- 
wide ^-values were calculated for the calibrated /^-values 
to estimate the FDR for each analysis (Additional file 11: 
Figure S7 and Additional file 12: Figure S8). Overall, the 
top 0.1% of SNPs based on raw CSS scores of the four 
datasets has considerably low FDR {q < 0.0001). Figure 2 
shows a clear distinction between the density distribu- 
tions of ^-values for the SNPs in identified regions and 
SNPs in the rest of the genome for each dataset. Table 3 
further shows that the identified genomic regions have a 
much higher proportion of SNPs with low <5^-values sug- 
gesting strong evidence for selection signals in the data. 
These proportions in the control regions in cattle are 
85.7% (regions Al) and 90% (region Bl) as compared to 
genome-wide proportions of 9.8% (dataset A) and 6.2% 
(dataset B), respectively. In sheep, 46.2% and 75.9% of 
total SNPs have q < 0.05 in candidate regions C5 and D4 
as compared to much lower values of 5.3% and 2.4% for 
datasets C and D, respectively for their neutral regions. 
Similarly, in all the non-candidate regions in the four 
datasets, the percentage of SNPs with q < 0.05 is signifi- 
cantly higher as compared to the rest of the genome 
(Table 3). 

Signatures of selection in geographically isolated multi- 
breed populations of cattle 

Finally, in dataset E, the smoothed scores from 37,827 
sliding windows (after removing windows containing < 5 
SNPs) were plotted along the genome in order to inves- 
tigate the regions under selection. Figure 3 shows the 
Manhattan plots of smoothed CSS scores for European 
and African groups of Bos taurus cattle; complete list of 
significant genomic regions and underlying genes in 
both groups are listed in (Additional file 13: Table S5). 
The comparison of each test by including and excluding 
breeds with less than 10 and 20 animals showed negli- 
gible differences for the effect of variable sample size 
(especially low) of breeds in European group (Additional 
file 14: Figure S9). It shows that breeds with a similar 
history generally have shared patterns of genetic diver- 
sity. In addition, it also provides evidence that computa- 
tion of CSS is not sensitive to the individual sample size 
of the participating breeds for outbred populations. 

We note that, overall, CSS method identified clear 
peaks of higher magnitudes in European group as com- 
pared to the African cattle (Figure 3). The differences in 
the historical and recent selection pressures can result in 
genome-wide excess of rare, potentially derived, alleles 
within a population as compared to a reference neutral 
population. It was further evident from the genome- 
wide average DAF (MAF) values of 0.38 (0.26) and 0.32 
(0.20), respectively showing that European and African 
cattle have experienced variable selection pressures. A 
comparison of chromosome-wise average of DAF and 



MAF shows a consistently higher selection in European 
group (Additional file 15: Figure SIO). Hence, we further 
investigated the significant genomic regions of European 
cattle for their underlying genes in relation to their 
unique phenotypes. Significant genomic regions were 
identified on BTA-1, BTA-13, BTA-14 and BTA-16 by 
CSS (Figure 3- A). These regions have been generally 
supported by the constituent selection tests and they 
contain genes of known functional role in several traits 
of economic importance in European cattle (Figure 4). 
However, additional genomic regions identified individu- 
ally by each of the constituent tests - other than com- 
mon with CSS - did not capture any known genes as 
candidates of selection signatures (Additional file 13: 
Table S5). 

Discussion 

This study illustrated a new approach, the CSS, for dis- 
covery of selection signatures in outbred populations, 
which combines three commonly used test statistics into 
a single index. As expected, each of individual tests (Fst> 
XP-EHH, ADAF/ASAF) can distinguish selection from 
neutrality but targets slightly different characteristics in 
the genetic polymorphism data that has been shaped by 
the selection. Hence, there was only partial agreement in 
the signals of selection (signatures) in these single tests 
at the candidate loci. Individually, the three tests also 
identified additional unique significant clusters with no 
known candidate genes that indicate their lack of sensitivity 
to localize real selection signature and high false selection 
signals (Additional file 10: Table S4). Many earlier studies 
in cattle and sheep reported selection signatures detected 
based on these individual tests [11-13,18,32,39-42,64]. 

The strength of CSS is to combine the component 
signals so that strongly selected regions harbouring a 
common signal across the constituent test statistics can 
be identified. The complementary signals from con- 
stituent statistics resulted in increased magnitude of 
CSS at target loci. For example, in dataset D, the high- 
est CSS cluster was found at the candidate gene region 
whereas, XP-EHH localized 5 Mb upstream and Fst and 
ASAF localized their top ranked signals at 4 Mb down- 
stream of this target region (Additional file 7: Figure S4). 
Overall, our results suggest that the CSS successfully lo- 
calized candidate gene regions in both species and both 
traits, thus providing a validation for this method (Figure 1, 
Table 2). 

Signatures of selection in traits specific groups of cattle 
and sheep 
Polled cattle 

A cluster of significant SNPs was successfully localized 
in the candidate region Al on BTA-1 that flanks the 
functional mutations in the POLL locus for poUedness. 
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(gray). Density plots are shown for polled cattle (A), double muscle cattle (B), polled sheep (C) and double muscle sheep (D). Vertical dashed ( 
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lines indicate Q-values (FDR) = 0.05 in each subset. Q-values were calculated from the calibrated p-values. Histograms of the mean Z, empirical and calibrated 
p-values are shown in Additional file 7: Figure S4. Relationship between Q-values and calibrated p-values is shown in Additional file 8: Figure S5. 



In addition, there were two significant clusters on BTA-13 
and BTA-14. We further investigated dataset A for any 
additional structure within the subpopulation of selected 
cattle breeds. In fact, besides the polledness and horn 
classification, there were also differences in the body 
size (stature) between the two groups. The polled group 



(Angus, Belted Galloway, Galloway, Murray Grey, Red 
Angus, Red Poll and Romosinuano) contains breeds of 
small to medium body size; whereas, in the horned 
group all of the breeds were of medium to large size, 
except Scottish Highland (7% of the horned group sam- 
ples) which is a small body size breed (Additional file 1: 



Table 3 False discovery rates within identified genomic regions in each validation dataset of cattle and sheep 



Region in 
Table 2 


Chromosome 


Total SNPs 


SNPs in region q<0.05 

% 


SNPs outside regions q<0.05 
% (in dataset) 


Al 


1 


14 


85.7 




A5 


13 


19 


78.9 


9.8 (A) 


A7 


14 


11 


81.8 




Bl 


2 


10 


90.0 




B2 


6 


11 


63.6 












6.2 (B) 


B6 


16 


11 


36.4 




B8 


18 


12 


41.7 




C5 


10 


26 


46.2 




C8 


13 


9 


44.4 


5.3 (C) 


CIO 


25 


15 


60.0 




D2 
D4 


2 
2 


23 
54 


87.0 
75.9 


2.4 (D) 



^Total number of SNPs located within the boundaries of the main cluster identified by CSS and their position exclude 0.5 Mb additions for gene investigation 
(shown in Table 2). 
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Figure 3 Composite selection signals (CSS) for geographically isolated cattle populations. Manhattan plots of -logio(p) of CSS are shown 
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stars are shown for raw CSS scores in the genome-wide background and bold at the putative selection regions underlying the significant clusters. 



Table SI). Indeed, the significant cluster on BTA-13 is 
located at the 66 Mb (region A5) which harbours a pair 
of genes {UQCC-GDFS) that has been significantly associ- 
ated with variation in human height [65-67] and body 
measurement traits in cattle [68]. In European and East 
Asian human populations, strong signals of recent selec- 
tion have also been identified near the GDF5 gene [60]. 
Similarly, the second most significant additional cluster on 
BTA-14 (region A7) in dataset A harbours the PLAGl and 
CHCHD7 genes which have been mapped for stature in 
cattle and human [17,19,20,48,67,69,70]. 

Double muscle cattle 

The highest CSS scores were found in the candidate 
regions on BTA-2 (region Bl) which harbours the 
functional mutations in MSTN gene for double musc- 
ling in cattle. In dataset B, several genes of interest 
were found in four additional clusters at regions B2, 
B6, and B8. Region B2 contains the FRYL gene within 
the peak at 68.0 Mb position on BTA-6. Significant se- 
lection signatures have previously been detected in 
this region and its flanking gene PDGFRA. This gene 
has been found connected to multiple molecular net- 
works involving |3-estradiol and is associated with 
reproduction in cattle [13]. Region B6 harbours solute 
carrier family genes, SLC25A33 and SLCC45A1 cover- 
ing the 45.0-46.0 Mb position on BTA-16. This region 
was reported as carrying highly differentiated loci and 



extended haplotype homozygosity in multiple breeds 
[71]. Region B8 contains the MCIR gene near the peak 
at 14.0-15.0 Mb position on BTA-18, where strong selec- 
tion signatures have previously been identified involving 
several breeds that have also been used in the present 
study [15]. The melanocortin 1 receptor {MCIR) gene is 
the candidate for coat colour in cattle [13,15,20,72,73]. 

Polled sheep 

In polled sheep, the regions with highest CSS scores 
were on OAR- 10 (region C5) near the RXFP2 gene for 
polledness. In addition, OAR- 13 (region C8) and OAR- 
25 (region CIO) exhibit the two significant peaks at posi- 
tions 68.0 Mb and 8.0 Mb, respectively. At the peak on 
OAR- 13 the footprints of selection have previously been 
reported for the ASIP gene [11] which controls black 
and white coat colour in sheep [74]. Selection signatures 
have also been reported for the cluster on OAR-25 
(region CIO) but the gene(s) and cause of selection were 
unknown Kijas et al. [11]. However, the low density 
lipoprotein-related protein 4 {LRP4) gene, located near 
this region (CIO), controls the inhibitory function on 
bone growth in human [75], hence, it may have some 
role in horn formation or it may play some role in the 
body size by controlling the body bone mass in sheep. 
Furthermore, this region contains a putative major gene/ 
QTL for wool quality and fibre diameter across a range 
of breeds [76,77]. 
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Double muscle sheep 

The genomic region with highest CSS scores was found 
on OAR-2 (region D4) near the MSTN gene for double 
muscling in Texel sheep breeds. In dataset D, all the 
additional significant peaks are also on OAR-2 and are 
in the candidate region or in LD with the candidate gene 
location spanning a region of almost 10 Mb. These re- 
sults suggest that a very strong selection pressure would 
identify a broad genomic region of selection signature, 
which may limit the power of fine-mapping and identif)^- 
ing the causal mutation in such a resource population. 
However, the complementary signals from constituent 
statistics at the target gene notably improved the magni- 
tude of CSS. 



Clearly across all traits in both species, composition of 
the breed panel in which selection signatures are to be 
detected may give rise to associations with more than 
one trait, i.e. confounding, which could give rise to spuri- 
ous signals open to misinterpretation. Hence, independent 
validation or within-breed linkage studies may be required 
for further confirmation of such selection signatures. 

Signatures of selection in geographically isolated 
Bos taurus 

Strong selection signatures for several economically im- 
portant traits in European cattle were identified at five 
genomic regions (Figure 3-A, Figure 4). At the proximal 
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end of BTA-1, the POLL locus [53] was identified for the 
whole group. While the polled phenotypes is not com- 
mon in all the European breeds, due to the economics 
of dehorning and increased demand for animal safety ac- 
quired in the natural poUedness, the POLL locus is being 
introgressed in most of the commercial cattle breeds. 
Moreover, our results can also be explained as the com- 
mon haplotypes at the POLL locus found to be shared in 
several polled and horned breeds [47]. 

On BTA-13, CDC123 and CAMKID genes have been 
reported to participate in various functions of pancreatic 
beta-cell and genetic variants at this locus have been as- 
sociated to type 2 diabetes susceptibility in human [78]. 
The associated gene pair has been known for its role in 
insulin-related metabolic traits [79]. In European cattle, 
these genes may be involved in several metabolic path- 
ways for sufficient availability of energy for improved 
growth, production and maintaining body temperature 
in a temperate environment. 

The prominent selection signal at the proximal end of 
BTA-14 underlies the DGATl gene that has been re- 
ported to have a significant role in several traits of dairy 
and beef production [14,28,80-84]. At another location 
on BTA-14, the stature (PLAGl-CHCHD?) genes (as 
discussed above) were also identified for the European 
group. The region on BTA-14 has been found to have a 
significant enrichment for the runs of homozygosity 
(ROH) - an indicator of strong LD - in the majority of 
cattle breed types (beef, dairy, English, European etc.) 
using the SNP data from the 50K and 800K BovineSNP 
chip assays [46]. Cattle have extensive LD patterns com- 
pared to human [85]. LD is likely to be even more exten- 
sive in the vicinity of a selective sweep and hence the 
frequency of selected alleles in these regions is likely to 
be high, being driven towards fixation [86]. 

The extreme peak in CSS for European cattle was 
found near the centre of BTA-16. The existence of a 
huge CSS can be explained such that several genes at 
this region have been reported as the candidate of 
strong selection. Signatures of selection have identified 
SLC25A33 and SLCC45A1 genes for their important role 
in immunity related to tropical adaptation [71]. Simi- 
larly, PIK3CD and SPSBl genes were also identified 
under selection respectively linked to immune response 
and immune regulation in both Angus and Simmental 
[17]. Signatures of selection have also been found in sev- 
eral breeds for NMNATl [32] and RERE [17] genes 
which have been associated with embryonic growth and 
reproductive development. At the same location, the 
KIFIB gene was identified under strong selection in 
Holstein cattle [13]. In addition, another gene, AGTRAP, 
which is located at 1 Mb upstream to the CSS peak, has 
been identified for dairy production due to its role in 
mammary glands [15]. 



Overall, in European cattle, we note that the mag- 
nitude of CSS scores corresponded to the diversified 
and extensive role of underlying candidate genes. 
This provides further evidence for CSS to capture 
trait-specific genomic regions as illustrated in valid- 
ation datasets. 

In the African Bos taurus, in general, the lack of pro- 
nounced selective pressures as compared to the European 
counterparts has resulted in localizing the significant CSS 
in non-genic regions or regions harbouring genes of 
unknown effects (Additional file 13: Table S5). Additional 
limiting factors such as SNP ascertainment bias and high 
admixture in African taurine due to excessive crossbreeding 
with African indicine cattle could also have contributed to 
the randomly dispersed signatures of selection. 

The effects of SNP origin (ascertainment bias) on vari- 
ous estimators of population genetic parameters and 
some practical methods for correcting them have been 
discussed elsewhere [87-90]. In cattle, the SNP panel 
(50K BovineSNP chip) was designed predominately based 
on the genetic polymorphisms in European breeds 
[3,17,52] which resulted in low representation of rare 
variants, thus a lower SNP diversity within some 
non-European cattle breeds, especially in Bos indicus 
or African Bos taurus. The SNP ascertainment bias 
has been found to have profound effects in the com- 
bined analysis of worldwide breeds [3,10,38]. The sig- 
nificant SNPs clustering in composite tests partly 
depend on the haplotype-based component tests, es- 
pecially those that are derived from EHH [16,23,38], 
which can be significantly affected by breeds used to dis- 
cover SNPs [91]. 

We adopted a cautious approach by excluding the 
indicine breeds from the available genotypic data of the 
African cattle [10,38,49] to minimize diversity within the 
African cattle group in dataset E. Generally, morpho- 
logical and genetic data suggest a common origin for 
African and European taurines [39]. That can be suitable 
in analyses of multi-breed group comparison, i.e., assum- 
ing that both populations are closely related, while 
differentially selected at a few genomic regions. Never- 
theless, phylogenetic investigations have shown early 
divergence between the African and European taurine 
cattle and high genetic relatedness between the African 
taurine and indicine cattle [3,10]. Indicine allelic enrich- 
ment in the African taurines Y-chromosome [92] and 
autosomal SNPs has also suggested a high genetic ad- 
mixture in several African breeds [10], that could have 
swept out several taurine-specific genomic regions. Hence, 
genome-wide high heterogeneity within the African co- 
hort could not help resolve the mapping signature of 
selection for various candidates of selection; for example, 
climatic adaptation and resistance to various pathogens in 
African 5o5 taurus [49,64]. 
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Comprehensive discovery analyses performed within 
the African breeds are more likely to capture genomic 
regions that have been targets of selection in those 
breeds, as the genome-wide scans for selection signa- 
tures comparing relatively close populations are least 
confounded by common biases [93]. Moreover, add- 
itional accuracy is also expected by using high-density 
SNP panels such as the BovineHD SNPchip (800K SNPs) 
that has been designed to be less sensitive to the ascer- 
tainment bias for non-European cattle breeds [94]. 

Overall, despite the several confounding factors that 
may have limited the localization of previously known 
genes in the African cattle, the existence of significant 
CSS in functionally unknown genes and noncoding 
regions indicate putative regions under selection. In 
several species, the non-coding DNA sequences have 
been predicted for their multiple roles in structural and 
regulatory mechanism of chromosomes including DNA 
replication, epigenomic modifications, regulation of tran- 
scription and translation. With the knowledge of incom- 
plete genomic annotations and several genes of unknown 
functions in cattle, the functional importance of noncod- 
ing regions under evolutionary selective pressure cannot 
be underestimated. Functional annotation approaches and 
resources [95] such as, across species comparison for 
conserved DNA sequences may help further elucidate the 
uncharacterized selective sweeps of our results. 

FDR and power of CSS 

Considering all the individual regions as independent 
events of selection in the genome, all the identified gen- 
omic regions in validation datasets had a much higher 
proportion of significant SNPs {q < 0.05) as compared to 
the rest of the genome in each dataset (Figure 2, Table 3). 
Overall, combining multiple test statistics reduced the 
false signals of CSS as compared to individual constitu- 
ent tests (Additional file 10: Table S4). The strategy of 
grouping the phenotypically alike populations applied in 
the present study could have further reinforced the se- 
lection signals at the common traits candidate regions 
while neutralizing the population specific patterns of 
diversity elsewhere [15]. 

The power (sensitivity) of the individual methods to 
discriminate between true positives (due to directional 
selection) and false positives (due to the forces other 
than selection) is a critical factor in the choice of selec- 
tion tests. A combination of multiple statistics is ex- 
pected to improve the power of composite statistics by 
complementing the detectability of positive selection by 
individual tests [96], e.g., the haplotype-based tests may 
be affected due to the distribution of recombination hot- 
spots across the genome [31]. Haplotypes, on the other 
hand, being patterns of multiple SNPs, are less sensitive 
to ascertainment schemes of the genome-wide panels of 



SNPs. SNP-based tests localize in unknown and non-genic 
regions more frequently and are less specific as compared 
to haplotype estimates as shown in (Additional file 10: 
Table S4) and Qanbari et al. [32]. CSS combines multiple 
characteristics of the genetic diversity from the single 
locus polymorphisms and haplotype patterns which makes 
it less sensitive to the confounding effects of demography 
and recombination [97,98]. However, CSS being a com- 
posite of SNP and haplotype-based test statistic, can still 
be sensitive to SNP density, SNP ascertainment bias and 
the extent and variation of LD across the genome. 

The power of most studies of genome-wide selection 
scans is low because of the small sample sizes, SNP density, 
SNP ascertainment scheme and the test statistics used. 
Panels of outbred populations consisting of multiple breeds 
can be used to increase the sample size and to enhance the 
power of CSS. A large number of samples genotyped with 
various SNP panels are becoming available in many species. 
These data can be combined using imputation strategies 
[99] to increase the power of CSS. 

It is noteworthy to mention that without simulations, 
qualitative evaluation of gain in power of CSS is not 
possible for comparison with the constituent tests. 
Similarly, a direct comparison of CSS with the previ- 
ously published methods of combining multiple statis- 
tics [16,23,31,33,34,100] requires simulation data from 
robust models to depict the underlying dynamics of 
the population of interest along with powerful compu- 
tational tools for permutation iterations [15]. Such 
comparisons are difficult for a real dataset where it is 
almost impossible to subset contrasting populations 
for a single event of selection. However, successful and 
improved localization of candidate genes in cattle and 
sheep, by simply combing rank distributions of con- 
stituent tests, indicates the power of CSS. Moreover, 
CSS can incorporate additional test statistics to add 
power for localizing the selection signature. The choice of 
additional tests to incorporate complementary evidence 
may be based on their unique power under various as- 
sumptions of selection, availability of data information 
(phasing of ancestral and derived alleles) and a priori as- 
sumptions about the dynamics of populations of interest. 
Established selection tests, such as the across-population 
Rsb [30] test and within-population estimates of positive 
selection including integrated haplotype scores (iHS) [60], 
haplotype allelic count statistics called Svd [97] and com- 
posite log likelihood (CLL) [15], can also be included in 
the CSS computation. However, combining too many 
selection tests of similar specificity might bias the CSS 
scores towards those characteristics of related tests. 
This bias may be misleading when interpretations are 
made to generalize the contribution of all the compo- 
nent test statistics. Care is thus required to select 
constituent tests of CSS. 
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CSS is expected to be successful in identifying the can- 
didate genes for complex networks and selection events 
e.g., domestication, adaptation and production traits. 
Complex traits are usually controlled by a very large 
number of loci of small effects; consequently, selective 
pressures on their causal mutations drive a very slow 
change in the allele frequencies. This makes it more 
challenging to discover such genetic variants of small ef- 
fects. Comprehensive phenotypic records and robust trait- 
wise classification are required to efficiently characterize 
complex traits under selection. CSS can be further tuned 
with additional selection tests appropriate to distinguish 
genomic regions under selection for complex traits. To ro- 
bustly map positive selection for complex traits, some spe- 
cialized tests, such as birth date selection mapping [101] 
designed to identif)^ small changes in the allele frequency 
due to selection of polygenic traits can be appropriate. 
The biological functions underlying polygenic inheritance 
are controlled by the interactions between large networks 
of genes. Selective pressures depend on the degree of con- 
tribution and the position of genes in the network [102]. 
The evolutionary properties of the complex traits can be 
captured by exploring gene networks for the genes under 
the selective sweeps. Overall, using CSS along with GWAS 
[34], QTL mapping [100] and approaches including gene 
pathways [103] can elucidate the mechanism underlying 
diversity in complex traits. 

Conclusion 

We developed a method, composite selection signals 
(CSS), which appears to be efficient in identifying selec- 
tion signatures for traits and genes that have evolved 
rapidly under various selection pressures. It is a very ro- 
bust method for detecting selection signatures, as it does 
not depend on any distributional assumptions (normal- 
ity) of the constituent test statistics, and additional test 
statistics can easily be included, if they become available. 
The existence of strong signals linked to known candi- 
date genes, even in the absence of any casual SNP in the 
genotype data, validates the utility of the breed grouping 
strategy and methodology for deriving composite selec- 
tion signals. In addition, estimates of FDR also provide 
clear evidence that any cluster of significant SNPs cap- 
tured by CSS is highly likely to contain a strong candi- 
date (gene or variant) of positive selection. 

The majority of significant peaks outside the candidate 
regions in validation subsets were linked to various add- 
itional phenotypic classifications of cattle and sheep 
cohorts. For example, implementation of CSS identified 
UQCC'GDFS as the plausible candidate genes for stature 
which have known effects on development and skeletal 
growth. Our results also replicated the previously re- 
ported candidate locus containing PLAGl-CHCHD? 
genes for stature in cattle. Other notable secondary 



phenotypes include; coat colour, reproduction, bone 
growth and multiple functioning transporters of the sol- 
ute carrier family of genes. 

In European cattle, the historical impacts of long-term 
selection pressures for economically important traits 
were identified for polledness, adaptation, metabolism, 
growth rate, stature, immunity, reproduction and several 
candidate genes related to dairy and beef production. 

The presence of spurious selection signals is much 
lower in CSS as compared to individual constituent tests 
due to the unique signals of each constituent selection 
test are reduced while combining multiple test statistics. 
Taken together, CSS provides an improvement for the 
predictions of positive selection and demonstrates that 
probing the multiple pieces of evidence for positive se- 
lection can provide important insights into understand- 
ing trait-specific gene evolution. 

Data availability 

R scripts and high quality images are available from the 
corresponding author on request. 

Additional files 



Additional file 1: Table 51. The information about the breeds, animals 
and phenotype categories of cattle samples. 

Additional file 2: Table 52. The information about the breeds, animals 
and phenotype categories of sheep samples. 

Additional file 3: Table 53. Chromosome wise information regarding 
genotyping data of cattle and sheep. 

Additional file 4: Figure 51. Manhattan plots of SNP-wise scores for 
each selection test statistics (A: CSS, B: Fst, C: XP-EHH, D: ADAF) for polled 
cattle (dataset A). Gray dots in the background show raw scores and blue 
and orange dots in the foreground show smooth scores, averaged over 
SNPs within 1 Mb sliding windows. Red dotted lines indicate a threshold 
of top 0.1 percentile of the genome-wide smoothed scores for each of 
the selection test statistics. Red SQuare dots in each plot show the 
genome-wide highest raw signals. 

Additional file 5: Figure 52. Manhattan plots of SNP-wise scores for 
each selection test statistics (A: CSS, B: Fst, C: XP-EHH, D: ADAF) for double 
muscle cattle (dataset B). Gray dots in the background show raw scores and 
blue and orange dots in the foreground show smooth scores, averaged 
over SNPs within 1 Mb sliding windows. Red dotted lines indicate a 
threshold of top 0.1 percentile of the genome-wide smoothed scores for 
each of the selection test statistics. Red square dots in each plot show the 
genome-wide highest raw signals. The square dots are in dark brown colour 
in B plot as the highest Fst signals is more than 3 Mb upstream from the 
known candidate region on chromosome 2. 

Additional file 6: Figure 53. Manhattan plots of SNP-wise scores for 
each selection test statistics (A: CSS, B: Fst, C: XP-EHH, D: ASAF) for polled 
sheep (dataset C). Gray dots in the background show raw scores and blue 
and orange dots in the foreground show smooth scores, averaged over 
SNPs within 1 Mb sliding windows. Red dotted lines indicate a threshold 
of top 0.1 percentile of the genome-wide smoothed scores for each of 
the selection test statistics. Red square dots in each plot show the 
genome-wide highest raw signals. 

Additional file 7: Figure 54. Manhattan plots of SNP-wise scores for 
each selection test statistics (A: CSS, B: Fsj, C: XP-EHH, D: ASAF) for double 
muscle sheep (dataset D). Gray dots in the background show raw scores 
and blue and orange dots in the foreground show smooth scores, averaged 
over SNPs within 1 Mb sliding windows. Red dotted lines indicate a 
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threshold of top 0.1 percentile of the ger^ome-wide smoothed scores for 
each of the selectior^ test statistics. Red square dots in each plot show the 
genome-wide highest raw signals. The square dots are in dark brown colour 
in C and D plots where the highest XPEHH and ASAF signals are more than 
3 Mb upstream and downstream, respectively, from the known candidate 
region on chromosome 2. 

Additional file 8: Figure S5. Distribution of the number of SNPs in 
1 Mb sliding windows in cattle (A) and sheep (B). Bars in A and B 
indicate the frequency of sliding windows containing various number of 
SNPs out of the genome-wide distribution, i.e., 38,610 SNPs of cattle and 
47,502 SNPs of sheep data, respectively (details in Table 1, S3). The bars in 
red (black) colours show the mean ~ median (mode) numbers as 17 (18) 
and 19 (20) of SNPs for cattle and sheep data, respectively. 

Additional file 9: Figure S6. Genome-wide pairs plots (lower 
diagonals), histograms (diagonals) and correlations (upper diagonals) for 
constituent (XP-EHH, ASAF, Fst) and composite selection signals (CSS) for 
polled (A), double muscle (B), cattle polledness (C) and double muscling 
(D) in sheep. 

Additional file 10: Table S4. Complete list of genomic regions and 
genes harbouring significant SNPs identified by four tests in four cohorts 
of cattle and sheep data. Cluster of minimum three significant SNPs 
within a window spanning 1 Mb genomic locations centred on a core 
SNP above the threshold (top 0.1%) in multiple tests (smoothed statistics) 
are reported and are compared with each other. 

Additional file 11: Figure S7. Histograms of Mean Z, raw p-value and 
calibrated p-values distributions of the CSS: Histograms (top to bottom in 
each column) for polled cattle (column 1, red), double muscle cattle 
(column 2, green), polled sheep (column 3, purple) and double muscle 
sheep (column 4, blue). 

Additional file 12: Figure S8. False discovery rate (FDR) against p-values: 

q-values were calculated from the calibrated p-values. Vertical dotted ( ) 

and dashed ( ) lines indicate calibrated p-values at 0.01 and 0.05, 

respectively. Horizontal dotted and dashed lines indicate q-values (FDR) 
at 0.05 and 0.1, respectively. 

Additional file 13: Table S5. Selection signatures in European and 
African Bos taurus cattle populations. Complete list of selection signatures 
identified by composite (CSS) and constituent (XPEHH, EST, ADAE) 
selection tests. 

Additional file 14: Figure S9. Genome-wide comparison of using SNP 
genotype data from all breeds (Total: 46 breeds and N = 847), breeds 
with minimum 10 samples (Total: 26 breeds and N = 753) and breeds 
with minimum 20 samples (Total: 20 breeds and N = 652) for computing 
CSS (A), XP-EHH (B), ADAE (C) and Fst (D) for European Bos taurus cattle. 

Additional file 15: Figure S10. Chromosome-wise comparison of 
average derived allele frequencies (A) and average minor allele frequencies 
(B) between European and African Bos taurus cattle breeds. 
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